The purpose of this document is to detail my overall process for exploring a given Case Study, going from accessing the data to finding insights that answer the specified business task.

Business Task

Goal: Investigating the differences between members and casual riders in their usage of Cyclistic bikes to better understand how to convert riders into members.

Processing Data

Where is the data located?

Data has been made available by Motivate International Inc. under this license. For the purpose of this business task, all of the 2022 data and the available 2023 data (up to Q3 currently, 20/04/2023) was used.

According to the license, this data is permitted to be used as ‘source material, as applicable, in analyses, reports, or studies published or distributed for non-commercial purposes’.

In this fictional scenario, the data has been collected by Cyclistic, so it will be assumed to be reliable, original and credible for this buisness task.

Security considerations aren’t so relevant as data is accessible online and free-to-use (but not monetize). Regardless, the data was stored on a single-user accessible and password-protected PC.

How is the data organised?

Initially several CSV files of the same schema were downloaded as ZIP files and extracted and stored in a single directory. Using the below python script, all of the CSV files were merged into a single one.

from os import listdir
from os.path import isfile, join
import pandas as pd
def files(directory):
    onlyfiles = [f for f in listdir(directory) if isfile(join(directory, f))]
    return onlyfiles
def merge_csv_files(loc):
  for p in files(loc):
      if '.csv' in p:
          l = loc + "/" + p
          df2 = pd.read_csv(l)
          df = df2 if df is None else pd.concat([df, df2])

The schema of the merged CSV file

columns

types

n_missing

n_uniques

completion_rate

ride_id

string

0

6,307,141

1.0

rideable_type

string

0

3

1.0

start_station_name

string

921,168

1,705

0.9

start_station_id

string

921,300

1,321

0.9

end_station_name

string

985,758

1,728

0.8

end_station_id

string

985,899

1,326

0.8

member_casual

string

0

2

1.0

start_lat

numeric

0

1.0

start_lng

numeric

0

1.0

end_lat

numeric

6,284

1.0

end_lng

numeric

6,284

1.0

started_at

datetime

0

1.0

ended_at

datetime

0

1.0

Problems with data

The data isn’t entirely complete nor consistent:

  • Missing start_station_id and end_station_id data

    • Should data be filtered to rows with at least one or both?
  • There more unique start_station_name than start_station_id (same for end_station_id)

    • data entry mistakes?

    • name changes over time?

    • do ids represent multiple stations? will location data support this assumption?

  • Missing end_lat and end_lng data

    • Can this be inferred from start_lat and start_lng?

More data integrity issues were discovered by inspecting the data.

Relationships within data

The field to field relationships were investigated for the purpose of normalisation. If any one-to-one relationships are found, the data can be split into multiple tables for more efficient storage (and better data modelling).

How do names relate to station ids?

Discovered the following relationships:

  1. (end_station_id) many<—->many (end_station_name)

  2. (start_station_id) many<—->many (start_station_name)

end_station_id

end_station_name

1032

Public Rack - Kedvale Ave & 63rd St

1032

Public Rack - Kedvale Ave & 63rd St W

1039

Public Rack - Kedzie Ave & 61st Pl

1039

Public Rack - Kedzie Ave & 61st Pl W

13053

Green St & Washington Blvd

13053

Green St & Randolph St

Summary:

  • In simpler terms for both the start and end stations, two different stations by id may have the same name and a single station by id can have multiple names.

  • Further investigation would need to be carried out to determine the cause of these inconsistencies. A prelim search revealed some potential causes: street name changes; spelling mistakes.

  • Attempts at normalisation via extraction of a station table from this data set aren’t feasible prior to this investigation.

  • For now, this report will assume end_station_id and start_station_id to be identifiers for their respective stations.

How do station locations relate to station ids?

  1. (end_station_id) many<—->many (end_lng, end_lat)

  2. (start_station_id) many<—->many (start_lng, start_lat)

end_lat

end_lng

id_count

41.66

-87.55

2

41.68

-87.68

2

41.68

-87.64

2

41.68

-87.63

3

41.68

-87.62

3

41.69

-87.71

4

start_station_id

sd_lat

sd_lng

range_distance_variation

Hubbard Bike-checking (LBS-WH-TEST)

0.0620

0.0436

45,093.84

LF-005

0.0020

0.0018

43,332.07

600

0.0203

0.0091

42,275.41

599

0.0298

0.0144

41,396.61

596

0.0240

0.0049

41,357.62

605

0.0128

0.0025

40,653.37

Summary:

  • A single coordinate (lat, lng) can have multiple station ids associated with it and vice versa.

  • Insertion or rounding errors don’t explain why many stations ids correspond to many different locations.

  • By applying Haversine distances between the minimum and maximum coordinates found for a given station ID, 565 start station ids had a distance of over 1km.

How does the data help to answer the business task?

  • What locations and time do members and casuals concentrate at? Is there something connecting this?
  • What types of bikes do members and casuals typically choose?
  • Use longitudinal and latitude data to determine the total distances members and casuals travel?
  • Look into durations of travels

Cleaning the data

# Check the rideable_type categories
bike_categories <- unique(X2022_202303_divvy_tripdata$rideable_type)
print(bike_categories)
## [1] "electric_bike" "classic_bike"  "docked_bike"
# Check the member_casual categories
member_categories <- unique(X2022_202303_divvy_tripdata$member_casual)
print(member_categories)
## [1] "casual" "member"
# Check the date range
date_range <- c(min(X2022_202303_divvy_tripdata$started_at), max(X2022_202303_divvy_tripdata$ended_at))
print(date_range)
## [1] "2022-01-01 00:00:05 UTC" "2023-04-03 11:41:11 UTC"

What to do about missing data?

  1. Missing start_station_id and end_station_id: will only remove if both are null.
  2. Missing location data: Only an issue for end_lat and end_lng, but due to the data inconsistency between them and end_station_id, they cannot be inferred. For the sake of the report, they will be kept as null.
  3. Missing start_station_name and end_station_name: any missing entries will be kept as null because we are assuming start_station_id and end_station_id as the identifier for stations.
clean_data <- X2022_202303_divvy_tripdata %>% filter(!(is.na(start_station_id) & is.na(end_station_id)))

Adding important metrics

Using the datetime and latitude/longitude fields, a duration and distance could be calculated for each ride respectively.

# Used Haversine distance in library(geosphere)
example_clean_data <- clean_data %>% drop_na(start_lat, end_lng, end_lat, start_lng) %>% head()
h_distances <- apply(example_clean_data[c('start_lng', 'start_lat', 'end_lng', 'end_lat')], 1, function(x) distm(c(x[1], x[2]), c(x[3], x[4]), fun=distHaversine))
example_clean_data <- cbind(example_clean_data, h_distances)

# Calculating duration:
example_clean_data_with_metrics <- example_clean_data %>% mutate(ride_duration_mins=round(as.numeric(ended_at-started_at)/60, 3)) %>% rename(haversine_distance=h_distances)

flextable(example_clean_data_with_metrics %>% select(ride_id, ride_duration_mins, haversine_distance))

ride_id

ride_duration_mins

haversine_distance

D789EFF6D43AB2A1

11.267

1,468.216

9B9FEDC78CCF828F

3.417

0.000

241C440C74CB31BB

9.067

1,280.955

02DA3A10370A3F18

13.233

4,892.140

76C81BC148654648

9.733

1,388.439

53A7590B28ED25E2

0.750

0.000

Analysis

Insights can be found by grouping the rides by member or casuals and summarising based on a variety of different distance and duration and count aggregations for each group. Below show all the aggregations chosen:

basic_summary <- clean_data_with_metrics %>% group_by(member_casual) %>% 
  summarize(number_of_rides=n(), 
            mean_h_distance=mean(haversine_distance, na.rm=TRUE),
            median_h_distance=median(haversine_distance, na.rm=TRUE),
            total_h_distance=sum(haversine_distance, na.rm=TRUE),
            std_h_distance=sd(haversine_distance, na.rm=TRUE),
            mean_duration_mins=mean(ride_duration_mins, na.rm=TRUE),
            median_duration_mins=median(ride_duration_mins, na.rm=TRUE),
            total_duration_mins=sum(ride_duration_mins, na.rm=TRUE), 
            std_duration=sd(ride_duration_mins, na.rm=TRUE),
            number_of_electric=sum(rideable_type=='electric_bike'),
            number_of_classic=sum(rideable_type=='classic_bike'), 
            number_of_docked=sum(rideable_type=='docked_bike'), .groups = "drop")

counts_per_bike_type <- basic_summary  %>%  select(member_casual, number_of_electric, number_of_classic, number_of_docked, number_of_rides)
distances_summary <- basic_summary %>% select(member_casual, total_h_distance, median_h_distance, mean_h_distance, std_h_distance)
durations_summary <- basic_summary %>% select(member_casual, total_duration_mins, median_duration_mins, mean_duration_mins, std_duration)

How do members and casual differ in their chosen type of bikes?

bike_type_proprotions = counts_per_bike_type %>% mutate(electric=100*number_of_electric/number_of_rides, 
                                      classic=100*number_of_classic/number_of_rides,
                                      docked=100*number_of_docked/number_of_rides) %>%
        select(member_casual, electric, classic, docked) %>% 
        gather(key="rideable_type", value="percentage_of_rides", 2:4)

ggplot(data=bike_type_proprotions) + 
  geom_bar(mapping=aes(y=rideable_type, x=percentage_of_rides), stat='identity') +
  facet_wrap(~member_casual) + 
  labs(title="Bike Type Choices",
       y="Type of bike",
       x="Percentage of rides")

  • No members use docked bikes
  • Members use classic more than electric while casuals use electric more than classic

How do members and casual differ in their ride durations?

durations_by_member <- durations_summary %>% mutate(total_duration_hours=as.numeric(total_duration_mins/60)) %>% 
  select(member_casual, median_duration_mins, total_duration_hours, mean_duration_mins, std_duration)
flextable(durations_by_member)

member_casual

median_duration_mins

total_duration_hours

mean_duration_mins

std_duration

casual

13.033

1,135,391.9

30.10545

283.31525

member

8.683

746,042.4

12.51538

29.81876

  • Casual members go for slightly longer (in duration) rides than members.
  • There’s considerably more variation (~a factor of 10) in ride duration for casual riders than members.
distances_by_member <- distances_summary %>% mutate(total_distance_km=as.numeric(total_h_distance/1000),
                             mean_distance_km=as.numeric(mean_h_distance/1000),
                             std_distance_km=as.numeric(std_h_distance/1000),
                             median_distance_km=as.numeric(median_h_distance/1000)) %>% 
  select(member_casual, median_distance_km, total_distance_km, mean_distance_km, std_distance_km)  %>% 
  mutate_if(is.numeric, ~round(., 1))
flextable(distances_by_member)

member_casual

median_distance_km

total_distance_km

mean_distance_km

std_distance_km

casual

1.7

4,932,813

2.2

9.5

member

1.5

7,482,507

2.1

12.9

  • Casual and members travel similar distances.

How do members and casual riders vary by days and months?

By number of rides

  • Both casual and member riders have similar hotspots on Sunday and Saturday from 10am to 6pm
  • Considerably fewer number of rides before 5am on all days for both casual riders and members
  • Both casual and member riders have high number of rides occurring from 5am to 9am and 3pm to 7pm
  • Casual riders ride the most on the weekend, whereas members ride the fewest number of times on the weekend.

  • Greatest number of rides occur during June to September for both members and casuals; July being the highest for casuals, August (with July close behind) for members.
  • For both members and casuals, fewest number of rides happen during December
  • Members ride in march considerably more than casuals do

By duration of rides

  • By comparison to casual riders, members tend to have similar average ride durations through the weekdays and hours of the day.
  • Both member and casual riders have the their highest ride durations on Sunday and Saturday

  • Saturday and Sunday have the highest average ride durations for both Members and Casual riders.

  • No insights can be found here

  • No insights can be found here

  • Highest average ride durations occur for both casual and member riders from May to August
  • November and February have the lowst average ride durations for both casual and member riders

By distance of rides

  • Members and casual have a similar spread of ride distances throughout the days and months
  • Total ride distance heatmap is almost identical to total ride count heatmap over week days and hours

  • Similar distributions of ride distances (mean and median) over the hours

  • Highest average distances occur for both member and casuals on Saturday and Sundays

  • No insights can be found here

  • Highest average ride durations occur for both casual and member riders from May to August

How do members and casuals vary by location?

# Finding out where the stations are located

states <- map_data('state')

data_by_end_location <- data_by_end_location %>% filter(end_lng < 0)

ggplot() +
  geom_polygon(data=states, mapping=aes(x=long, y=lat, group=group)) +
  geom_text(data = states %>% group_by(region) %>% summarize(long=mean(long), lat=mean(lat)), mapping=aes(x=long, y=lat, group=region, label=region)) + 
  geom_point(data=data_by_end_location, mapping=aes(x = end_lng, y = end_lat), col = "red")

illinois <- states %>% filter(is.element(region, c('illinois')))

ggplot() +
  geom_polygon(illinois, mapping=aes(x=long, y=lat, group=group)) +
  geom_point(data=data_by_end_location, mapping=aes(x = end_lng, y = end_lat), col = "red")

The stations congregate in the North Eastern region of Illinois, specically within and around the Chicago area.

Loading a shapefile of Chicago grouped by community:

chicago_sf <- st_read(sf_loc)
## Reading layer `geo_export_ec6e177b-af8b-4325-b6b1-922000facde1' from data source `/Users/kimathinyota/CyclisticAnalysis/SHAPES/Chicago_community_areas/geo_export_ec6e177b-af8b-4325-b6b1-922000facde1.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 77 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -87.94011 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
## Geodetic CRS:  WGS84(DD)
ggplot(data=chicago_sf) +
  geom_sf()

Each location, start or end, were grouped by their given Chicago community using the Chicago shape file. A subset of this aggregated data by community is shown below.

community

total_members

total_casuals

total_classic

total_docked

total_riders

total_starts

total_ends

mid_latitude

mid_longitude

ALBANY PARK

14,783

9,749

7,800

723

24,532

5,834

13

41.96539

-87.71158

ARCHER HEIGHTS

1,303

1,495

0

0

2,798

96

5

41.80362

-87.72482

ARMOUR SQUARE

45,444

25,317

36,965

3,454

70,761

11,501

16

41.84575

-87.63337

ASHBURN

242

142

0

0

384

5

15

41.74600

-87.70411

AUBURN GRESHAM

1,182

1,426

824

61

2,608

483

19

41.74284

-87.65254

AUSTIN

5,373

9,052

3,090

387

14,425

2,307

39

41.88648

-87.76170

chicago_sf_with_stats <- chicago_sf_with_stats %>% mutate(member_percentage=round(100*total_members/total_riders,3), total_riders_per_thousand=total_riders/1000)

ggplot(data=chicago_sf_with_stats) +
  geom_sf(aes(fill=member_percentage)) +
  geom_point(aes(x = mid_longitude, y = mid_latitude, size=total_riders_per_thousand), alpha=0.2)  + 
  labs(title="Station locations in Chicago",
       fill="Percentage of members (%)",
       size="Total number of riders (in thousands)",
       x="Longitude",
       y="Latitude",
       alpha=NA)

Summary